112 research outputs found
An optimal parallel connectivity algorithm
AbstractA synchronized parallel algorithm of depth O(n2/p) for p (≤n2/log2 n) processors is given for the problem of computing connected components of an undirected graph. The speed-up of this algorithm is optimal in the sense that the depth of the algorithm is of the order of the running time of the fastest known sequential algorithm over the number of processors used
Empirical Challenge for NC Theory
Horn-satisfiability or Horn-SAT is the problem of deciding whether a
satisfying assignment exists for a Horn formula, a conjunction of clauses each
with at most one positive literal (also known as Horn clauses). It is a
well-known P-complete problem, which implies that unless P = NC, it is a hard
problem to parallelize. In this paper, we empirically show that, under a known
simple random model for generating the Horn formula, the ratio of
hard-to-parallelize instances (closer to the worst-case behavior) is
infinitesimally small. We show that the depth of a parallel algorithm for
Horn-SAT is polylogarithmic on average, for almost all instances, while keeping
the work linear. This challenges theoreticians and programmers to look beyond
worst-case analysis and come up with practical algorithms coupled with
respective performance guarantees.Comment: 10 pages, 5 figures. Accepted at HOPC'2
Project for Developing Computer Science Agenda(s) for High-Performance Computing: An Organizer's Summary
Designing a coherent agenda for the implementation of the High
Performance Computing (HPC) program is a nontrivial technical challenge.
Many computer science and engineering researchers in the area of HPC, who
are affiliated with U.S. institutions, have been invited to contribute
their agendas. We have made a considerable effort to give many in that
research community the opportunity to write a position paper. This
explains why we view the project as placing a mirror in front of the
community, and hope that the mirror indeed reflects many of the opinions
on the topic.
The current paper is an organizer's summary and represents his reading
of the position papers. This summary is his sole responsibility. It is
respectfully submitted to the NSF.
(Also cross-referenced as UMIACS-TR-94-129
Granularity of parallel memories
Consider algorithms which are designed for shared memory models of parallel computation in which processors are allowed to have fairly unrestricted access patterns to the shared memory. General fast simulations of such algorithms by parallel machines in which the shared memory is organized in modules where only one cell of each module can be accessed at a time are proposed. The paper provides a comprehensive study of the problem. The solution involves three stages:
(a) Before a simulation, distribute randomly the memory addresses among the memory modules.
(b) Keep several copies of each address and assign memory requests of processors to the "right\u27; copies at any time.
(c) Satisfy these assigned memory requests according to specifications of the parallel machine
An Immediate Concurrent Execution (ICE) Abstraction Proposal for Many-Cores
Settling on a simple abstraction that programmers aim at, and hardware and software systems people enable and support, is an important step towards convergence to a robust many-core platform.
The current paper: (i) advocates incorporating a quest for the simplest possible abstraction in the debate on the future of many-core computers, (ii) suggests “immediate concurrent execution (ICE)” as a new abstraction, and (iii) argues that an XMT architecture is one possible demonstration of ICE providing an easy-to-program general-purpose many-core platform
Parallel unit propagation: Optimal speedup 3CNF Horn SAT
A linear work parallel algorithm for 3CNF Horn SAT is presented, which is interesting since the problem is P-complete
Can Parallel Algorithms Enhance Serial Implementation?
Consider the serial emulation of a parallel algorithm. The thesis
presented in this paper is rather broad. It suggests that such a serial
emulation has the potential advantage of running on a serial machine
faster than a standard serial algorithm for the same problem.
The main concrete observation is very simple: just before the serial
emulation of a round of the parallel algorithm begins, the whole list of
memory addresses needed during this round is readily available; and, we
can start fetching all these addresses from secondary memories at this time.
This permits prefetching the data that will be needed in the next "time
window", perhaps by means of pipelining; these data will then be ready at
the fast memories when requested by the CPU. The possibility of
distributing memory addresses (or memory fetch units) at random over
memory modules, as has been proposed in the context of implementing the
parallel-random-access machine (PRAM) design space, is discussed.
This work also suggests that a multi-stage effort to build a parallel
machine may start with "parallel memories" and serial processing,
deferring parallel processing to a later stage. The general approach has
the following advantage: a user-friendly parallel programming language
can be used already in its first stage. This is in contrast to a practice
of compromising user-friendliness of parallel computer interfaces (i.e.,
parallel programming languages), and may offer a way for alleviating a
so-called "parallel software crisis".
It is too early to reach conclusions regarding the significance of the
thesis of this paper. Preliminary experimental results with respect to
the fundamental and practical problem of constructing suffix trees
indicate that drastic improvements in running time might be possible.
Serious attempts to follow it up are needed to determine its usefulness.
Parts of this paper are intentionally written in an informal way,
suppressing issues that will have to be resolved in the context of a
concrete implementation. The intention is to stimulate debate and provoke
suggestions and other specific approaches.
Validity of our thesis would imply that a standard computer science
curriculum, which prepares young graduates for a professional career of
over forty years, will have to include the topic of parallel algorithms
irrespective of whether (or when) parallel processing will succeed serial
processing in the general purpose computing market.
(Also cross-referenced as UMIACS-TR-91-145.1
XMTSim: A Simulator of the XMT Many-core Architecture
This paper documents the features and the design of XMTSim, the cycle-accurate simulator of the Explicit Multi-Threading
(XMT) computer architecture. The Explicit Multi-Threading (XMT) is a general-purpose many-core computing platform,
with the vision of a 1000-core chip that is easy to program but does not compromise on performance. XMTSim is a primary
component in its publicly available toolchain along with an optimizing compiler. Research and experimentation enabled by
the toolchain played a central role in supporting the ease-of-programming and performance aspects of the XMT architecture.
The compiler and the simulator are also important milestones for an efficient programmer's workflow from PRAM algorithms
to programs that run on the shared memory XMT hardware. This workflow is a key component in accomplishing the goal of
ease-of-programming and performance.
The applicability of the XMT simulator extends beyond specific XMT choices. It can be used to explore the much greater
design space of shared memory many-cores by system researchers or by programmers. As the toolchain can practically run on
any computer, it provides a supportive environment for teaching parallel algorithmic thinking with a programming component.National Science Foundation grant CCF-081150
Empirical Speedup Study of Truly Parallel Data Compression
We present an empirical study of novel work-optimal parallel
algorithms for Burrows-Wheeler compression and decompression
of strings over a constant alphabet. To validate
these theoretical algorithms, we implement them on the experimental
XMT computing platform developed especially
for supporting parallel algorithms at the University of Maryland.
We show speedups of up to 25x for compression, and
13x for decompression, versus bzip2, the de facto standard
implementation of Burrows-Wheeler compression. Unlike
existing approaches, which assign an entire (e.g., 900KB)
block to a processor that processes the block serially, our
approach is “truly parallel” as it processes in parallel the
entire input. Besides the theoretical interest in solving the
“right” problem, the importance of data compression speed
for small inputs even at great expense of quality (compressed
size of data) is demonstrated by the introduction of Google’s
Snappy for MapReduce. Perhaps surprisingly, we show feasibility
of holding on to quality, while even beating Snappy
on speed.
In turn, this work adds new evidence in support of the
XMT/PRAM thesis: that an XMT-like many-core hardware/
software platform may be necessary for enabling general-purpose
parallel computing. Comparison of our results to recently
published work suggests 70x improvement over what
current commercial parallel hardware can achieve.NSF grants CCF-0811504 and CNS116185
Parallel Algorithms for Burrows-Wheeler Compression and Decompression
We present work-optimal PRAM algorithms for Burrows-Wheeler compression
and decompression of strings over a constant alphabet. For a string of
length n, the depth of the compression algorithm is O(log2 n), and the depth
of the the corresponding decompression algorithm is O(log n). These appear
to be the first polylogarithmic-time work-optimal parallel algorithms for any
standard lossless compression scheme.
The algorithms for the individual stages of compression and decompression
may also be of independent interest: 1. a novel O(log n)-time, O(n)-work
PRAM algorithm for Huffman decoding; 2. original insights into the stages of
the BW compression and decompression problems, bringing out parallelism
that was not readily apparent, allowing them to be mapped to elementary
parallel routines that have O(log n)-time, O(n)-work solutions, such as: (i)
prefix-sums problems with an appropriately-defined associative binary operator
for several stages, and (ii) list ranking for the final stage of decompression.NSF grant CCF-081150
- …